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Abstract 

This paper describes the National Research Coun- 
cil (NRC) Word Sense Disambiguation (WSD) sys- 
tem, as applied to the English Lexical Sample (ELS) 
task in Senseval-3. The NRC system approach- 
es WSD as a classical supervised machine learn- 
ing problem, using familiar tools such as the Weka 
machine learning software and Brill's rule-based 
part-of-speech tagger. Head words are represent- 
ed as feature vectors with several hundred features. 
Approximately half of the features are syntactic and 
the other half are semantic. The main novelty in the 
system is the method for generating the semantic 
features, based on word co-occurrence probabilities. 
The probabilities are estimated using the Waterloo 
MultiText System with a corpus of about one ter- 
abyte of unlabeled text, collected by a web crawler. 

1 Introduction 

The Senseval-3 English Lexical Sample (ELS) task 
requires disambiguating 57 words, with an average 
of roughly 140 training examples and 70 testing 
examples of each word. Each example is about a 
paragraph of text, in which the word that is to be dis- 
ambiguated is marked as the head word. The aver- 
age head word has around six senses. The training 
examples are manually classified according to the 
intended sense of the head word, inferred from the 
surrounding context. The task is to use the training 
data and any other relevant information to automat- 
ically assign classes to the testing examples. 

This paper presents the National Research Coun- 
cil (NRC) Word Sense Disambiguation (WSD) 
system, which generated our four entries for 
the Senseval-3 ELS task (NRC-Fine, NRC-Fine2, 
NRC-Coarse, and NRC-Coarse2). Our approach to 
the ELS task is to treat it as a classical supervised 
machine learning problem. Each example is repre- 
sented as a feature vector with several hundred fea- 
tures. Each of the 57 ambiguous words is represent- 
ed with a different set of features. Typically, around 
half of the features are syntactic and the other half 



are semantic. After the raw examples are convert- 
ed to feature vectors, the Weka machine learning 
software is used to induce a model of the training 
data and predict the classes of the testing examples 
dWitten and Frank, 1999) . 

The syntactic features are based on part-of- 
speech tags, assigned by a rule-based tagger 
flBrill, 1994 . The main innovation of the NRC 
WSD system is the method for generating 
the semantic features, which are derived from 
word co-occurrence probabilities. We estimated 
these probabilities using the Waterloo MultiText 
System with a corpus of about one terabyte 
of unlabeled text, collected by a web crawler 
dClarke et al.,"T995l [Oarke and Cormack, 2000[ 
ITerra and Clarke, 2003| i. 

In SectionEl we describe the NRC WSD system. 
Our experimental results are presented in Section |3] 
and we conclude in Section |4] 

2 System Description 

This section presents various aspects of the system 
in roughly the order in which they are executed. The 
following definitions will simplify the description. 
Head Word: One of the 57 words that are to be 
disambiguated. 

Example: One or more contiguous sentences, illus- 
trating the usage of a head word. 
Context: The non-head words in an example. 
Feature: A property of a head word in a context. 
For instance, the feature tag_hpl_NNP is the prop- 
erty of having (or not having) a proper noun (NNP 
is the part-of-speech tag for a proper noun) immedi- 
ately following the head word (hpl represents the 
location head plus one). 

Feature Value: Features have values, which 
depend on the specific example. For instance, 
tag_hpl_NNP is a binary feature that has the value 
1 {true: the following word is a proper noun) or 
{false: the following word is not a proper noun). 
Feature Vector: Each example is represented by 
a vector. Features are the dimensions of the vector 



space and a vector of feature values specifies a point 
in the feature space. 

2.1 Preprocessing 

The NRC WSD system first assigns part-of-speech 
tags to the words in a given example (Brill, 1994), 
and then extracts a nine-word window of tagged 
text, centered on the head word (i.e., four words 
before and after the head word). Any remaining 
words in the example are ignored (usually most of 
the example is ignored). The window is not allowed 
to cross sentence boundaries. If the head word 
appears near the beginning or end of the sentence, 
where the window may overlap with adjacent sen- 
tences, special null characters fill the positions of 
any missing words in the window. 

In rare cases, a head word appears more than once 
in an example. In such cases, the system selects 
a single window, giving preference to the earliest 
occurring window with the least nulls. Thus each 
example is converted into one nine-word window of 
tagged text. Windows from the training examples 
for a given head word are then used to build the fea- 
ture set for that head word. 

2.2 Syntactic Features 

Each head word has a unique set of feature names, 
describing how the feature values are calculated. 
Feature Names: Every syntactic feature has a name 
of the form matchtype jpositionjnodel. There are 
three matchtypes, ptag, tag, and word, in order 
of increasingly strict matching. A ptag match is 
a partial tag match, which counts similar part-of- 
speech tags, such as NN (singular noun), NNS (plu- 
ral noun), NNP (singular proper noun), and NNPS 
(plural proper noun), as equivalent. A tag match 
requires exact matching in the part-of-speech tags 
for the word and the model. A word match requires 
that the word and the model are exactly the same, 
letter-for-letter, including upper and lower case. 

There are five positions, hm2 (head minus two), 
hml (head minus one), hdO (head), hpl (head plus 
one), and hp 2 (head plus two). Thus syntactic fea- 
tures use only a five-word sub-window of the nine- 
word window. 

The syntactic feature names for a head word 
are generated by all of the possible legal combina- 
tions of matchtype, position, and model. For ptag 
names, the model can be any partial tag. For tag 
names, the model can be any tag. For word names, 
the model names are not predetermined; they are 
extracted from the training windows for the given 
head word. For instance, if a training window con- 
tains the head word followed by "of", then one of 
the features will be word_hpl_of . 



For word names, the model names are not 
allowed to be words that are tagged as nouns, verbs, 
or adjectives. These words are reserved for use in 
building the semantic features. 
Feature Values: The syntactic features are all 
binary-valued. Given a feature with a name of the 
form matchtype 4>osition_model, the feature value 
for a given window depends on whether there is a 
match of matchtype between the word in the posi- 
tion position and the model model. For instance, 
the value of tag_hpl_NNP depends on whether 
the given window has a word in the position hpl 
(head plus one) with a tag (part-of-speech tag) that 
matches NNP (proper noun). Similarly, the feature 
word_hpl_of has the value 1 {true) if the given 
window contains the head word followed by "of"; 
otherwise, it has the value (false). 

2.3 Semantic Features 

Each head word has a unique set of feature names, 
describing how the feature values are calculated. 
Feature Names: Most of the semantic features have 
names of the form position_model. The position 
names can be pre (preceding) or f ol (following). 
They refer to the nearest noun, verb, or adjective 
that precedes or follows the head word in the nine- 
word window. 

The model names are extracted from the training 
windows for the head word. For instance, if a train- 
ing window contains the word "compelling", and 
this word is the nearest noun, verb, or adjective that 
precedes the head word, then one of the features will 
be pre_compelling. 

A few of the semantic features have a different 
form of name, avg _position_sense. In names of this 
form, position can be pre (preceding) or f ol (fol- 
lowing), and sense can be any of the possible senses 
(i.e., classes, labels) of the head word. 
Feature Values: The semantic features are all 
real-valued. For feature names of the form posi- 
tion jmodel, the feature value depends on the seman- 
tic similarity between the word in position position 
and the model word model. 

The semantic similarity between two words 
is estimated by their Pointwise Mutual Informa- 
tion, PMI^i, W2), using Information Retrieval 
iTurney, 200T]|Terra and Clarke, 2003D : 

PMl(wi,w 2 ) = log 2 — — — r . 

\p{w 1 )p{w 2 ) J 

We estimate the probabilities in this equation by 
issuing queries to the Waterloo MultiText System 
flClarke et al, 19951 IClarke and Cormack, 2000j 
|Terra and Clarke, 2003| >. Laplace smoothing is 



weka. classifiers. meta.Bagging 

-W weka.classifiers.meta.MultiClassClassifier 
-W weka.classifiers.meta.Vote 

-B weka. classifiers . functions . support Vector. SMO 

-B weka.classifiers.meta.LogitBoost -W weka. classifiers. trees. DecisionStump 

-B weka.classifiers.meta.LogitBoost -W weka. classifiers. functions. SimpleLinearRegression 

-B weka. classifiers . trees . adtree . ADTree 

-B weka.classifiers.rules.JRip 

Table 1: Weka (version 3.4) commands for processing the feature vectors. 



applied to the PMI estimates, to avoid division by 
zero. 

PMI(ioi, 102) has a value of zero when the 
two words are statistically independent. A high 
positive value indicates that the two words tend 
to co-occur, and hence are likely to be seman- 
tically related. A negative value indicates that 
the presence of one of the words suggests the 
absence of the other. Past work demonstrates 
that PMI is a good estimator of semantic similari- 
ty H'lurney, 200l] |Terra and Clarke, 2003| l and that 
features based on PMI can be useful for supervised 



learning (Turney, 2003). 

The Waterloo MultiText System allows us to set 
the neighbourhood size for co-occurrence (i.e., the 
meaning of w\ A W2). In preliminary experiments 
with the ELS data from Senseval-2, we got good 
results with a neighbourhood size of 20 words. 

For instance, if w is the noun, verb, or adjec- 
tive that precedes the head word and is nearest to 
the head word in a given window, then the value 
of pre_compelling is PMI(io, compelling). If 
there is no preceding noun, verb, or adjective within 
the window, the value is set to zero. 

In names of the form avg _position_sense, the 
feature value is the average of the feature values of 
the corresponding features. For instance, the val- 
ue of avg_pre_argument_l_l 0_02 is the aver- 
age of the values of all of the prejnodel features, 
such that model was extracted from a training win- 
dow in which the head word was labeled with the 
sense argument_l_10_02. 

The idea here is that, if a testing example should 
be labeled, say, argument_l_10_02, and w\ is a 
noun, verb, or adjective that is close to the head 
word in the testing example, then PMI(u;i, W2) 
should be relatively high when W2 is extract- 
ed from a training window with the same sense, 
argument_l_10_02, but relatively low when w-i 
is extracted from a training window with a different 
sense. Thus avg_pojirion_argument_l_10_02 
is likely to be relatively high, compared to other 
avg jpositionsense features. 

All semantic features with names of the form 
position jnodel are normalized by converting them 



to percentiles. The percentiles are calculated sepa- 
rately for each feature vector; that is, each feature 
vector is normalized internally, with respect to its 
own values, not externally, with respect to the oth- 
er feature vectors. The pre features are normalized 
independently from the f ol features. The semantic 
features with names of the form avg _position_sense 
are calculated after the other features are normal- 
ized, so they do not need any further normalization. 
Preliminary experiments with the ELS data from 
Senseval-2 supported the merit of percentile nor- 
malization, which was also found useful in another 
application where features based on PMI were used 
for supervised learning ( Turney, 2Q03| >. 



2.4 Weka Configuration 

Table[2shows the commands that were used to exe- 
cute Weka (Witte n and Fran k, 1999). The default 
parameters were used for all of the classifiers. Five 
base classifiers (-B) were combined by voting. Mul- 
tiple classes were handled by treating them as mul- 
tiple two-class problems, using a 1-against-all strat- 
egy. Finally, the variance of the system was reduced 
with bagging. 

We designed the Weka configuration by eval- 
uating many different Weka base classifiers on 
the Senseval-2 ELS data, until we had identified 
five good base classifiers. We then experiment- 
ed with combining the base classifiers, using a 
variety of meta-learning algorithms. The result- 
ing system is somewhat similar to the JHU sys- 
tem, which had the best ELS scores in Senseval-2 
(Yaro wsky et al., 2001| >. The JHU system combined 
four base classifiers using a form of voting, called 
Thresholded Model Voting ( jYarowsky et al., 2001| ). 

2.5 Postprocessing 

The output of Weka includes an estimate of the 
probability for each prediction. When the head 
word is frequently labeled U (unassignable) in the 
training examples, we ignore U examples during 
training, and then, after running Weka, relabel the 
lowest probability testing examples as U. 



System 


Fine-Grained Recall 


Coarse-Grained Recall 


M OCllaCVol J OVftLClll 


7? 9% 




NRC-Fine 


69.4% 


75.9% 


NRC-Fine2 


69.1% 


75.6% 


NRC-Coarse 


NA 


75.8% 


NRC-Coarse2 


NA 


75.7% 


Median Senseval-3 System 


65.1% 


73.7% 


Most Frequent Sense 


55.2% 


64.5% 



Table 2: Comparison of NRC-Fine with other Senseval-3 ELS systems. 



3 Results 

A total of 26 teams entered 47 systems (both 
supervised and unsupervised) in the Senseval-3 
ELS task. Table |2] compares the fine-grained and 
coarse-grained scores of our four entries with other 
Senseval-3 systems. 

With NRC-Fine and NRC-Coarse, each seman- 
tic feature was scored by calculating its PMI with 
the head word, and then low scoring semantic fea- 
tures were dropped. With NRC-Fine2 and NRC- 
Coarse2, the threshold for dropping features was 
changed, so that many more features were retained. 
The Senseval-3 results suggest that it is better to 
drop more features. 

NRC-Coarse and NRC-Coarse2 were designed to 
maximize the coarse score, by training them with 
data in which the senses were relabeled by their 
coarse sense equivalence classes. The fine scores 
for these two systems are meaningless and should be 
ignored. The Senseval-3 results indicate that there 
is no advantage to relabeling. 

The NRC systems scored roughly midway 
between the best and median systems. This per- 
formance supports the hypothesis that corpus-based 
semantic features can be useful for WSD. In future 
work, we plan to design a system that combines 
corpus-based semantic features with the most effec- 
tive elements of the other Senseval-3 systems. 

For reasons of computational efficiency, we chose 
a relatively narrow window of nine-words around 
the head word. We intend to investigate whether a 
larger window would bring the system performance 
up to the level of the best Senseval-3 system. 

4 Conclusion 

This paper has sketched the NRC WSD system for 
the ELS task in Senseval-3. Due to space limita- 
tions, many details were omitted, but it is likely that 
their impact on the performance is relatively small. 

The system design is relatively straightforward 
and classical. The most innovative aspect of the sys- 
tem is the set of semantic features, which are purely 
corpus-based; no lexicon was used. 
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